Welcome to a workshop on analysing spatial data! The downstream analyses used in this workshop is in theory applicable to any format of data where we have a matrix of genes by cells, along with their x-y coordinates, whether that’s from technologies such as PhenoCycler, IMC, Xenium, or MERRFISH.

In this workshop, we’ll be re-analysing a head and neck squamous cell carcinoma dataset provided by Angela Ferguson, and published in Clinical Cancer Research.

For some context, head and neck cutaneous squamous cell carcinomas (HNcSCC) are the second most common type of skin cancer. The majority of HNcSCC can be treated with surgery and good local control, but a subset of large tumours infiltrate subcutaneous tissue and are considered high risk for local recurrence and metastases. The key conclusion of this manuscript (amongst others) is that spatial information about cells and the immune environment can be used to predict primary tumour progression or metastases in patients. We will use our spicy workflow to reach a similar conclusion.

We will aim to cover:

  • Introduction to imaging data and how to store it in R

  • How to segment cells from your data in R

  • How to annotate the cells you have segmented out

  • What kind of insights can I gain from my spatial data?

This workshop is suited for a beginner audience, however it is assumed that you have an understanding of the basic science that underpins spatial transcriptomics/proteomics data (i.e I know what a gene/protein is, and why disease might cause a change in gene/protein expression).

This workshop will take you all the way through from images in the form of a TIFF file, all the way to getting nice plots and visualisations from this data.

1 Getting started

1.1 Installing packages

Bioinformatics packages are typically stored on bioconductor. When installing an R package from bioconductor, we need to first install BiocManager (a library that allows us to access bioconductor within R)

# To install biocmanager
install.packages("BiocManager")

I’ve also provided some code for installing all the packages that we will be using in this workshop. This will take some time to install, and you may encounter errors when you try to run this code, depending on what packages you may already have installed.

BiocManager::install(c("cytomapper", "scater", "tidySingleCellExperiment", "SpatialExperiment", "simpleSeg", "FuseSOM", "spicyR", "lisaClust"))

1.1.1 Prerequisite packages

As a brief introduction to the packages, the first packages we will want to install are cytomapper, scater, tidySingleCellExperiment and SpatialExperiment. These are a couple prerequisite packages not developed by our group, but are super useful in helping us visualise and load in our imaging data. dplyr and ggplot2 are standard R packages for dealing with data.

library(cytomapper)
library(scater)
library(tidySingleCellExperiment)
library(SpatialExperiment)

library(dplyr)
library(ggplot2)

theme_set(theme_classic())

1.1.2 Our packages.

In this workshop, we’ll be introducing 4 packages from spicyWorkflow. These are:

  • simpleSeg: For segmenting and normalising your cells

  • FuseSOM: For clustering (and self-annotating) your cells

  • spicyR: For identifying differential localisation of cell types between conditions.

  • lisaClust: For identifying consistent spatial organisation (regions) of cell types.

  • ClassifyR: For identifying consistent spatial organisation (regions) of cell types.

library(simpleSeg)
library(FuseSOM)
library(spicyR)
library(lisaClust)
library(ClassifyR)

1.2 Loading the data

nCores = 40
BPPARAM <- simpleSeg:::generateBPParam(cores = nCores)

First, let’s talk about files. When we’re working with IMC images, we normally have each image as a folder containing a number of TIFF images, each one being an image for a specific marker/channel. We’ll want to specify the directory that contains all of these folders, and then feed that into the loadImages() function.

pathToImages <- "/dskh/nobackup/alexq/spicyWorkflow/inst/extdata/Ferguson_Images"

# Store images in a CytoImageList on_disk as h5 files to save memory.
images <- cytomapper::loadImages(
  pathToImages,
  single_channel = TRUE,
  on_disk = TRUE,
  h5FilesPath = HDF5Array::getHDF5DumpDir(),
  BPPARAM = BPPARAM
)

mcols(images) <- S4Vectors::DataFrame(imageID = names(images))

1.2.1 Clean channel names

As we’re reading the image channels directly from the names of the TIFF image, often these channel names will need to be cleaned for ease of downstream processing.

The channel names can be accessed from the CytoImageList object using the channelNames() function.

cn <- channelNames(images) # Read in channel names
head(cn)
## [1] "139La_139La_panCK.ome"      "141Pr_141Pr_CD20.ome"      
## [3] "142Nd_142Nd_HH3.ome"        "143Nd_143Nd_CD45RA.ome"    
## [5] "146Nd_146Nd_CD8a.ome"       "147Sm_147Sm_podoplanin.ome"
cn <- sub(".*_", "", cn) # Remove preceding letters
cn <- sub(".ome", "", cn) # Remove the .ome
head(cn)
## [1] "panCK"      "CD20"       "HH3"        "CD45RA"     "CD8a"      
## [6] "podoplanin"
channelNames(images) <- cn # Reassign channel names

1.2.2 Clean image names

Similarly, the image names will be taken from the folder name containing the individual TIFF images for each channel. These will often also need to be cleaned.

head(names(images))
## [1] "ROI001_ROI 01_F3_SP16-001550_1E" "ROI002_ROI 02_F4_SP16-001550_1E"
## [3] "ROI003_ROI 03_F5_SP16-001550_1E" "ROI005_ROI 05_G4_SP17-002069_1F"
## [5] "ROI006_ROI 06_G5_SP17-002069_1F" "ROI007_ROI 07_G6_SP17-005715_1B"
nam <- sapply(strsplit(names(images), "_"), `[`, 3)
head(nam)
## [1] "F3" "F4" "F5" "G4" "G5" "G6"
names(images) <- nam # Reassigning image names
mcols(images)[["imageID"]] <- nam # Reassigning image names

2 SimpleSeg: Segment (and normalise) your cells!

Our simpleSeg R package on https://github.com/SydneyBioX/simpleSeg provides a series of functions to generate simple segmentation masks of images. These functions leverage the functionality of the EBImage package on Bioconductor. For more flexibility when performing your segmentation in R we recommend learning to use the EBimage package. A key strength of the simpleSeg package is that we have coded multiple ways to perform some simple segmentation operations as well as incorporating multiple automatic procedures to optimise some key parameters when these aren’t specified.

2.1 Run simpleSeg

If your images are stored in a list or CytoImageList they can be segmented with a simple call to simpleSeg(). To summarise, simpleSeg() is an R implementation of a simple segmentation technique which traces out the nuclei using a specified channel using nucleus then dilates around the traced nuclei by a specified amount using discSize. The nucleus can be traced out using either one specified channel, or by using the principal components of all channels most correlated to the specified nuclear channel by setting pca = TRUE.

In the particular example below, we have asked simpleSeg to do the following:

  • By setting nucleus = c("HH3"), we’ve asked simpleSeg to trace out the nuclei signal in the images using the HH3 channel.
  • By setting pca = TRUE, simpleSeg segments out the nuclei mask using a principal component analysis of all channels and using the principal components most aligned with the nuclei channel, in this case, HH3.
  • By setting cellBody = "dilate", simpleSeg uses a dilation strategy of segmentation, expanding out from the nucleus by a specified discSize.
  • By setting discSize = 3, simpleSeg dilates out from the nucleus by 3 pixels.
  • By setting sizeSelection = 20, simpleSeg ensures that only cells with a size greater than 20 pixels will be used.
  • By setting transform = "sqrt", simpleSeg square root transforms each of the channels prior to segmentation.
  • By setting tissue = c("panCK", "CD45", "HH3"), we specify a tissue mask which simpleSeg uses, filtering out all background noise outside the tissue mask. This is important as these are tumour cores, wand hence circular, so we’d want to ignore background noise which happens outside of the tumour core.

There are many other parameters that can be specified in simpleSeg (smooth, watershed, tolerance, and ext), and we encourage the user to select the best parameters which suit their biological context.

masks <- simpleSeg(images,
                   nucleus = c("HH3"),
                   pca = TRUE,
                   cellBody = "dilate",
                   discSize = 3,
                   sizeSelection = 20,
                   transform = "sqrt",
                   tissue = c("panCK", "CD45", "HH3"),
                   cores = nCores
                   )

Try segmenting yourself! I’ve provided a zip folder containing 1 image - i.e. a single folder of 36 channels for you to play around with the parameters and see what you change.

# Code to load in a single image
pathToSingleImage <- "/dskh/nobackup/alexq/spicyWorkflow/inst/extdata/Ferguson_Images/ROI001_ROI 01_F3_SP16-001550_1E"

# Store images in a CytoImageList on_disk as h5 files to save memory.
singleImage <- cytomapper::loadImages(
  pathToSingleImage,
  single_channel = TRUE,
  on_disk = TRUE,
  h5FilesPath = HDF5Array::getHDF5DumpDir(),
  BPPARAM = BPPARAM
)

mcols(singleImage) <- S4Vectors::DataFrame(imageID = names(singleImage))

cn <- channelNames(singleImage) # Read in channel names
cn <- sub(".*_", "", cn) # Remove preceding letters
cn <- sub(".ome", "", cn) # Remove the .ome
channelNames(singleImage) <- cn # Reassign channel names

nam <- sapply(strsplit(names(singleImage), "_"), `[`, 3)
names(singleImage) <- nam # Reassigning image names
mcols(singleImage)[["imageID"]] <- nam # Reassigning image names

Play around with these parameters!

# Write your own code!!!
singleMasks <- simpleSeg(singleImage,
                   nucleus = c("HH3"),
                   pca = TRUE,
                   cellBody = "dilate", # Maybe change me?
                   discSize = 3, # Change me!
                   sizeSelection = 20, # Change me!
                   transform = "sqrt", # Change me!
                   tissue = c("panCK", "CD45", "HH3")
                   )

plotPixels(image = singleImage["F3"], 
           mask = singleMasks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3"), 
           display = "single",
           colour = list(HH3 = c("black","blue")),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2)
           ))

2.2 Visualise outlines

The plotPixels() function in cytomapper makes it easy to overlay the mask on top of the nucleus intensity marker to see how well our segmentation process has performed. Here we can see that the segmentation appears to be performing reasonably.

plotPixels(image = images["F3"], 
           mask = masks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3", "CD31", "podoplanin"), 
           display = "single",
           colour = list(HH3 = c("black","blue"),
                         CD31 = c("black", "red"),
                         podoplanin = c("black", "green") ),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2),
             CD31 = c(0, 1, 2),
             podoplanin = c(0, 1, 1.5)
           ))

2.3 Summarise cell features.

In order to characterise the phenotypes of each of the segmented cells, measureObjects() from cytomapper will calculate the average intensity of each channel within each cell as well as a few morphological features. By default, the measureObjects() function will return a SingleCellExperiment object, where the channel intensities are stored in the counts assay and the spatial location of each cell is stored in colData in the m.cx and m.cy columns.

However, you can also specify measureObjects() to return a SpatialExperiment object by specifying return_as = "spe". As a SpatialExperiment object, the spatial location of each cell is stored in the spatialCoords slot, as m.cx and m.cy, which simplifies plotting. In this demonstration, we will return a SpatialExperiment object.

# Summarise the expression of each marker in each cell
cells <- cytomapper::measureObjects(masks,
                                    images,
                                    img_id = "imageID",
                                    return_as = "spe",
                                    BPPARAM = BPPARAM)

spatialCoordsNames(cells) <- c("x", "y")

2.4 Load the clinical data

To associate features in our image with disease progression, it is important to read in information which links image identifiers to their progression status. We will do this here, making sure that our imageID match.

clinicalData <- read.csv("/dskh/nobackup/alexq/spicyWorkflow/inst/extdata/clinicalData_TMA1_2021_AF.csv")

rownames(clinicalData) <- clinicalData$imageID
clinicalData <- clinicalData[names(images), ]

# Put clinical data into SingleCellExperiment object
colData(cells) <- cbind(colData(cells), clinicalData[cells$imageID, ])
load("/dskh/nobackup/alexq/spicyWorkflow/vignettes/spe_Ferguson_2022.rda")

3 Normalising your cells

Normalisation is an extremely important step of any workflow, and we should always first visualise our marker intensities to determine of they require some sort of transformation or normalisation.

This reasons for normalisation are two-fold:

  1. The intensities of images are often highly skewed, preventing any meaningful downstream analysis.
  2. The intensities across different images are often different, meaning that what is considered “positive” can be different across images.

Let’s take a look at these effects in our dataset:

# Plot densities of CD3 for each image.
cells |> 
  join_features(features = rownames(cells), shape = "wide", assay = "counts") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")

We can transform and normalise our data using the normalizeCells function. In the normalizeCells() function, we specify the following parameters. transformation is an optional argument which specifies the function to be applied to the data. We do not apply an arcsinh transformation here, as we already apply a square root transform in the simpleSeg() function. method = c("trim99", "mean", PC1") is an optional argument which specifies the normalisation method/s to be performed. Here, we: 1) Trim the 99th percentile 2) Divide by the mean 3) Remove the 1st principal component assayIn = "counts" is a required argument which specifies what the assay you’ll be taking the intensity data from is named. In our context, this is called counts.

This modified data is then stored in the norm assay by default. We can see that this normalised data appears more bimodal, not perfectly, but likely to a sufficient degree for clustering, as we can at least observe a clear CD3+ peak at 1.00, and a CD3- peak at around 0.3.

# Leave out the nuclei markers from our normalisation process. 
useMarkers <- rownames(cells)[!rownames(cells) %in% c("DNA1", "DNA2", "HH3")]

# Transform and normalise the marker expression of each cell type.
cells <- normalizeCells(cells,
                        markers = useMarkers,
                        transformation = NULL,
                        method = c("trim99", "mean", "PC1"),
                        assayIn = "counts",
                        cores = nCores
)

# Plot densities of CD3 for each image
cells |> 
  join_features(features = rownames(cells), shape = "wide", assay = "norm") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")

4 FuseSOM: Cluster your cells!

We can appreciate from the UMAP that there is a division of clusters, most likely representing different cell types. We next aim to empirically distinguish each cluster using our FuseSOM package for clustering.

Our FuseSOM R package can be found on bioconductor at https://www.bioconductor.org/packages/release/bioc/html/FuseSOM.html, and provides a pipeline for the clustering of highly multiplexed in situ imaging cytometry assays. This pipeline uses the Self Organising Map architecture coupled with Multiview hierarchical clustering and provides functions for the estimation of the number of clusters.

Here we cluster using the runFuseSOM function. We specify the number of clusters to identify to be numClusters = 10. We also specify a set of cell-type specific markers to use, as we want our clusters to be distinct based off cell type markers, rather than markers which might pick up a transitioning cell state.

4.1 Perform the clustering

# Set seed.
set.seed(51773)

ct_markers <- c("podoplanin", "CD13", "CD31",
                "panCK", "CD3", "CD4", "CD8a",
                "CD20", "CD68", "CD16", "CD14", "CD66a")

# Generate SOM and cluster cells into 10 groups
cells <- runFuseSOM(
  cells,
  markers = ct_markers,
  assay = "norm",
  numClusters = 10
)

We can also observe how reasonable our choice of k = 10 was, using the estimateNumCluster() and optiPlot() functions. Here we examine the Gap method, but others such as Silhouette and Within Cluster Distance are also available.

cells <- estimateNumCluster(cells, kSeq = 2:30)
optiPlot(cells, method = "gap")

4.2 Attempt to interpret the phenotype of each cluster

We can begin the process of understanding what each of these cell clusters are by using the plotGroupedHeatmap function from scater. At the least, here we can see we capture all the major immune populations that we expect to see, including the CD4 and CD8 T cells, the CD20+ B cells, the CD68+ myeloid populations, and the CD66+ granulocytes.

# Visualise marker expression in each cluster.
scater::plotGroupedHeatmap(
  cells,
  features = ct_markers,
  group = "clusters",
  exprs_values = "norm",
  center = TRUE,
  scale = TRUE,
  zlim = c(-3, 3),
  cluster_rows = FALSE,
  block = "clusters"
)

We can then apply our labels to these cell types.

cells <- cells |>
  mutate(cellType = case_when(
    clusters == "cluster_1" ~ "GC",
    clusters == "cluster_2" ~ "MC",
    clusters == "cluster_3" ~ "OI",
    clusters == "cluster_4" ~ "EP",
    clusters == "cluster_5" ~ "SC",
    clusters == "cluster_6" ~ "Undefined",
    clusters == "cluster_7" ~ "EC",
    clusters == "cluster_8" ~ "BC",
    clusters == "cluster_9" ~ "TC_CD4",
    clusters == "cluster_10" ~ "TC_CD8"
  ))

We might also be interested in how these clusters are distributed on the images themselves. Here we examine the distribution of clusters on image F3.

reducedDim(cells, "spatialCoords") <- spatialCoords(cells)

cells |> 
  filter(imageID == "F3") |> 
  plotReducedDim("spatialCoords", colour_by = "cellType")

4.3 Check cluster frequencies

We find it always useful to check the number of cells in each cluster. Here we can see that we have alot of squamous tumour cells and much fewer dendritic cells.

cells$cellType |>
  table() |>
  sort()
## 
## Undefined        BC        MC        GC    TC_CD8    TC_CD4        EP        OI 
##      2575      3017      6167      7553      8684     11995     12261     12678 
##        EC        SC 
##     14259     76724

Another very popular method of visualising our clusters is using a UMAP. This takes very long to run and will break your R if you try to interrupt it, so probably don’t run this during the workshop.

set.seed(51773)
# Perform dimension reduction using UMAP.
cells <- scater::runUMAP(
  cells,
  subset_row = ct_markers,
  exprs_values = "norm",
  name = "normUMAP"
)

someImages <- unique(cells$imageID)[c(1, 5, 10, 20, 30, 40)]

# UMAP by imageID.
scater::plotReducedDim(
  cells[, cells$imageID %in% someImages],
  dimred = "normUMAP",
  colour_by = "cellType"
)

5 Downstream analyses: spicyR & lisaClust

5.1 Basic proportion analyses

The colTest function allows us to quickly test for associations between the proportions of the cell types and progression status using either Wilcoxon rank sum tests or t-tests. Here we see a p-value less than 0.05, but this does not equate to a small FDR.

# Perform simple student's t-tests on the columns of the proportion matrix.
testProp <- colTest(cells, 
                    condition = "group", 
                    feature = "cellType",
                    type = "ttest")

head(testProp)
##           mean in group NP mean in group P tval.t  pval adjPval   cluster
## TC_CD8               0.057           0.034   2.70 0.011    0.11    TC_CD8
## TC_CD4               0.072           0.096  -2.10 0.044    0.22    TC_CD4
## Undefined            0.019           0.014   1.80 0.084    0.28 Undefined
## MC                   0.038           0.047  -1.60 0.110    0.28        MC
## EC                   0.099           0.085   1.50 0.140    0.28        EC
## BC                   0.016           0.010   0.85 0.400    0.67        BC

Let’s examine one of these clusters using a boxplot.

prop <- getProp(cells, feature = "cellType")
clusterToUse <- rownames(testProp)[1]

prop |>
  select(all_of(clusterToUse)) |>
  tibble::rownames_to_column("imageID") |>
  left_join(as.data.frame(colData(cells)), by = "imageID") |>
  ggplot(aes(x = group, y = .data[[clusterToUse]], fill = group)) +
  geom_boxplot()

5.2 spicyR: Cell Type Localisations

Our spicyR package is available on bioconductor on https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html and provides a series of functions to aid in the analysis of both immunofluorescence and imaging mass cytometry data as well as other assays that can deeply phenotype individual cells and their spatial location. Here we use the spicy function to test for changes in the spatial relationships between pair-wise combinations of cells.

Put simply, spicyR uses the L-function to determine localisation or dispersion between cell types. The L-function is an arbitrary measure of “closeness” between points, with greater values suggesting increased localisation, and lower values suggesting dispersion.

Here, we quantify spatial relationships using a combination of three radii Rs = c(20, 50, 100) and mildly account for some global tissue structure using sigma = 50. Further information on how to optimise these parameters can be found in the vignette and the spicyR paper.

spicyTest <- spicy(cells,
                   condition = "group",
                   cellTypeCol = "cellType",
                   imageIDCol = "imageID",
                   Rs = 1:10*10,
                   sigma = 50,
                   BPPARAM = BPPARAM)

topPairs(spicyTest, n = 10)
##                        intercept coefficient    p.value adj.pvalue      from
## TC_CD4__EP            -17.291697    64.70955 0.02857272  0.6306348    TC_CD4
## TC_CD8__OI              2.520092    54.47818 0.02911094  0.6306348    TC_CD8
## EP__TC_CD4            -17.170495    63.77083 0.02976289  0.6306348        EP
## MC__Undefined         -45.724067    70.93448 0.03858075  0.6306348        MC
## EP__BC               -194.290564   192.49532 0.04171342  0.6306348        EP
## BC__EP               -186.126674   192.01456 0.04293275  0.6306348        BC
## Undefined__MC         -44.945367    66.67640 0.04679223  0.6306348 Undefined
## OI__TC_CD8              4.123409    50.11153 0.05211892  0.6306348        OI
## Undefined__Undefined  357.438334  -122.59698 0.05675713  0.6306348 Undefined
## OI__EP                -16.606006    51.06253 0.08634602  0.7696853        OI
##                             to
## TC_CD4__EP                  EP
## TC_CD8__OI                  OI
## EP__TC_CD4              TC_CD4
## MC__Undefined        Undefined
## EP__BC                      BC
## BC__EP                      EP
## Undefined__MC               MC
## OI__TC_CD8              TC_CD8
## Undefined__Undefined Undefined
## OI__EP                      EP

We can visualise these tests using signifPlot where we observe that cell type pairs appear to become less attractive (or avoid more) in the progression sample.

# Visualise which relationships are changing the most.
signifPlot(
  spicyTest,
  breaks = c(-1.5, 1.5, 0.5)
)

spicyR also has functionality for plotting out individual pairwise relationships. We can first try look into whether the major tumour cell type localises with the major myeloid cell type, and whether this localisation affects progression vs non-progression of the tumour.

spicyBoxPlot(spicyTest, 
             from = "SC", 
             to = "GC")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

5.3 lisaClust: Determining Regions

Our lisaClust package https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html provides a series of functions to identify and visualise regions of tissue where spatial associations between cell-types is similar. This package can be used to provide a high-level summary of cell-type co-localisation in multiplexed imaging data that has been segmented at a single-cell resolution. Here we use the lisaClust function to clusters cells into 5 regions with distinct spatial ordering.

set.seed(51773)

# Cluster cells into spatial regions with similar composition.
cells <- lisaClust(
  cells,
  k = 4,
  sigma = 50,
  cellType = "cellType",
  BPPARAM = BPPARAM
)
## Generating local L-curves. If you run out of memory, try 'fast = FALSE'.

5.3.1 Region - cell type enrichment heatmap

We can try to interpret which spatial orderings the regions are quantifying using the regionMap function. This plots the frequency of each cell type in a region relative to what you would expect by chance.

# Visualise the enrichment of each cell type in each region
regionMap(cells, cellType = "cellType", limit = c(0, 2))

5.3.2 Visualise regions

By default, these identified regions are stored in the regions column in the colData of our object. We can quickly examine the spatial arrangement of these regions using ggplot.

cells |> 
  filter(imageID == "F3") |> 
  plotReducedDim("spatialCoords", colour_by = "region")

While much slower, we have also implemented a function for overlaying the region information as a hatching pattern so that the information can be viewed simultaneously with the cell type calls.

# Use hatching to visualise regions and cell types.
hatchingPlot(
  cells,
  useImages = "F3",
  cellType = "cellType",
  nbp = 300
)
## Concave windows are temperamental. Try choosing values of window.length > and < 1 if you have problems.

5.3.3 Test for association with progression

If needed, we can again quickly use the colTest function to test for associations between the proportions of the cells in each region and progression status using either Wilcoxon rank sum tests or t-tests.

# Test if the proportion of each region is associated
# with progression status.
testRegion <- colTest(
  cells,
  feature = "region",
  condition = "group",
  type = "ttest"
)

testRegion
##          mean in group NP mean in group P tval.t pval adjPval  cluster
## region_2            0.038           0.022   1.50 0.15    0.60 region_2
## region_1            0.200           0.220  -0.63 0.53    0.77 region_1
## region_4            0.150           0.140   0.46 0.65    0.77 region_4
## region_3            0.610           0.620  -0.29 0.77    0.77 region_3
reg <- getProp(cells, feature = "region")
regionToUse <- rownames(testRegion)[1]

reg |>
  select(all_of(regionToUse)) |>
  tibble::rownames_to_column("imageID") |>
  left_join(as.data.frame(colData(cells)), by = "imageID") |>
  ggplot(aes(x = group, y = .data[[regionToUse]], fill = group)) +
  geom_boxplot()

6 ClassifyR: Predicting Progressor vs Non-Progressor status

# Create list to store data.frames
data <- list()

# Add proportions of each cell type in each image
data[["Proportions"]] <- getProp(cells, "cellType")

# Add pair-wise associations
spicyMat <- bind(spicyTest)
spicyMat[is.na(spicyMat)] <- 0
spicyMat <- spicyMat |>
  select(!condition) |>
  tibble::column_to_rownames("imageID")

data[["SpicyR"]] <- spicyMat

# Add proportions of each region in each image
# to the list of dataframes.
data[["LisaClust"]] <- getProp(cells, "region")
# Set seed
set.seed(51773)

# Preparing outcome vector
outcome <- cells$group[!duplicated(cells$imageID)]
names(outcome) <- cells$imageID[!duplicated(cells$imageID)]

idx <- names(sample(outcome[outcome == "NP"], 14))
outcome <- outcome[!names(outcome) %in% idx]

data <- lapply(data, function(x) x[!rownames(x) %in% idx,])

# Perform cross-validation of a random forest model
# with 100 repeats of 5-fold cross-validation.
cv <- crossValidate(
  measurements = data,
  outcome = outcome,
  classifier = "randomForest",
  nFolds = 5,
  nRepeats = 50,
  nCores = nCores
)

6.1 Visualise cross-validated prediction performance

Here we use the performancePlot function to assess the AUC from each repeat of the 5-fold cross-validation. We see that the lisaClust regions appear to capture information which is predictive of progression status of the patients.

# Calculate AUC for each cross-validation repeat and plot.
performancePlot(
  cv,
  metric = "AUC",
  characteristicsList = list(x = "Assay Name")
)
## Warning in .local(results, ...): AUC not found in all elements of results.
## Calculating it now.

library(grid)
samplesMetricMap(cv)
## Warning in .local(results, ...): Sample Accuracy not found in all elements of
## results. Calculating it now.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_tile()`).

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (2-2,1-1) arrange       gtable[layout]
## 2 2 (1-1,1-1) arrange text[GRID.text.1663]
set.seed(51773)

cv <- crossValidate(
  measurements = data,
  outcome = outcome,
  classifier = "randomForest",
  nFolds = 5,
  nRepeats = 50,
  multiViewMethod = "merge",
  nCores = nCores
)

performancePlot(
  cv,
  metric = "AUC",
  characteristicsList = list(x = "Assay Name")
)
## Warning in .local(results, ...): AUC not found in all elements of results.
## Calculating it now.
library(grid)
samplesMetricMap(cv)
## Warning in .local(results, ...): Sample Accuracy not found in all elements of
## results. Calculating it now.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_tile()`).

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (2-2,1-1) arrange       gtable[layout]
## 2 2 (1-1,1-1) arrange text[GRID.text.2020]